Skip to content

Simplify approach for creating dag run and task spans#62554

Open
dstandish wants to merge 39 commits intoapache:mainfrom
astronomer:minimize-otel-code
Open

Simplify approach for creating dag run and task spans#62554
dstandish wants to merge 39 commits intoapache:mainfrom
astronomer:minimize-otel-code

Conversation

@dstandish
Copy link
Contributor

@dstandish dstandish commented Feb 27, 2026

Replace otel spans tracking implementation with something much simpler

  • rip out existing spans
  • add back the dag run span
  • add back a task span (this is handled in the runner)
    • the detailed task spans will be done in a followup
  • rip out Trace, _TraceMeta, DebugTrace "classes" --> later
  • remove the "Tracer" protocol and EmptyTrace --> later
    • we don't need to keep them for backcompat because they were never really used~
    • EmptyTrace was returned when otel not enabled. but it should be fine to lean on the normal behavior of otel to create non-recording spans when otel not enabled, and the updates to OtelTrace should accomplish this
      deprecate the OtelTrace class
    • update its methods so that it does as little as possible but doesn't break dags for any unfortunate souls that decided to use it
  • re-implement the configuration bit which looks at our deprecated otel config vars and the env vars supported by otel

@boring-cyborg boring-cyborg bot added area:DAG-processing area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:task-sdk area:Triggerer labels Feb 27, 2026
@xBis7 xBis7 mentioned this pull request Feb 27, 2026
2 tasks
@dstandish dstandish force-pushed the minimize-otel-code branch 3 times, most recently from 74b3942 to 236e5ee Compare March 3, 2026 16:40
@dstandish dstandish changed the title WIP - Minimize otel code Simplify approach for creating dag run and task spans Mar 3, 2026
@dstandish dstandish force-pushed the minimize-otel-code branch 2 times, most recently from a1cb165 to a85b5eb Compare March 5, 2026 03:29
Copy link
Contributor

@xBis7 xBis7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dstandish Great job! I left some comments but generally it looks good and I think it can be converted to an actual PR.

Additionally, I tested it manually and I ran a dag with long timeouts. I didn't see any issues.

@dstandish dstandish force-pushed the minimize-otel-code branch from d90deb0 to b4166e7 Compare March 10, 2026 03:37
type: string
example: ~
default: "False"
otel_task_runner_span_flush_timeout_millis:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another config we use the full word “milliseconds” so this should match. That’s the only place we use milliseconds though; elsewhere we always use seconds instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

millis vs msec no opinion on (Other than the "go duration" mode of 1s/300ms would be nice here but out of scope for this PR).

As for seconds vs milliseconds here: It appears the millis is only an artifact of the Python API, and this isn't commonly set as an environment variable in OTEL installs, so we can choose what makes most sense for us, so we could have this as float number of seconds and convert to the format otel wants.

WDYT @uranusjr ?

type: string
example: ~
default: "False"
otel_task_runner_span_flush_timeout_millis:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too late to change this now, but the otel_ prefix on all of this is needless :(

Timeout in milliseconds to wait for the OpenTelemetry span exporter to flush pending spans
when a task runner process exits. If the exporter does not finish within this time, any
buffered spans may be dropped.
version_added: 3.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
version_added: 3.1.0
version_added: 3.2.0

Comment on lines +1037 to +1039
status_code = StatusCode.OK if state == DagRunState.SUCCESS else StatusCode.ERROR
span.set_status(status_code)
span.end()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickstenning @xBis7 Question on the otel/tracing side. What happens if we never emit this DagRun span, or don't emit it for for days?

It's possible the dag run is left in the running state-- one case when this can happen is that the dag run is started and then the user goes and pauses the Dag in the UI/API, which will mean the state of the dag run is never finiallized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't think we can do much about this, just might need to document it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we will only see the task spans. The dag run will need to finish so that the span is created and exported. If that happens, then the dag run span will be the parent of the task spans but until then we will only see the children.

TraceContextTextMapPropagator().extract(msg.ti.context_carrier) if msg.ti.context_carrier else None
)
ti = msg.ti
span_name = f"task_run.{ti.task_id}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickstenning Should span names be namespaced too?

from airflow.sdk.configuration import conf

timeout_millis = conf.getint(
"traces", "otel_task_runner_span_flush_timeout_millis", fallback=30000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this config need to be specific to task_runner, or could it apply to all places where we flush before exit?

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small questions, but looks good to me.

@xBis7
Copy link
Contributor

xBis7 commented Mar 10, 2026

@dstandish @ashb Please take a look at this #62554 (comment)

I can't unresolve the conversation and the comments are hidden.

Apart from that, it looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:DAG-processing area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:task-sdk area:Triggerer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants